Skip to content

DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 17, 2025

Conversation

pshapiro4broad
Copy link
Member

@pshapiro4broad pshapiro4broad commented Mar 11, 2025

Increase the datarepo tools pool size to prevent tests from blocking or failing unnecessarily.

This value was changed from 1500 to 2000: #293
And then was changed from 2000 to 1000: #374

The usage of this pool depends heavily on how much work is being done on TDR. If there are more than three developers creating one or two branches each, a pool size of 1000 can be exhausted due to concurrent test execution. In addition to more activity, recent changes to allow integration tests to run in parallel, and to avoid needing to "lock" an integration host to run tests on, may have put a greater burden on this resource.

Here's the pool usage graph of the last 7 days
image

The exhaustion spike occurred yesterday, and the capacity dipped down to 15% last week as well, so this usage doesn't seem atypical. Graphana's data only goes back 10 days so there's no way to easily see long term trends.

@pshapiro4broad pshapiro4broad requested a review from a team as a code owner March 11, 2025 15:09
@pshapiro4broad pshapiro4broad requested review from davidangb, marctalbott and snf2ye and removed request for a team March 11, 2025 15:09
Copy link

Copy link

@fboulnois fboulnois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 as long as Justin is ok with it

@jyang-broad
Copy link
Contributor

jyang-broad commented Mar 17, 2025

I'm not necessarily against this if it's breaking, but I do want to pause a moment to diagnose what's causing this. According to Phil's screenshot as well as one i took below
Screenshot 2025-03-17 at 09 28 35
It looks like there's a noted drop in buffer availablility at 3pm ET Mondays followed by a lesser one on tuesday. Is there some sort of scheduled testing that happens weekly?

If this is totally normal activity, then that's fine, but it's a little notable that this seems to look like a strong spike in testing that extends past 5pm.

@pshapiro4broad
Copy link
Member Author

pshapiro4broad commented Mar 17, 2025

It looks like there's a noted drop in buffer availablility at 3pm ET Mondays followed by a lesser one on tuesday. Is there some sort of scheduled testing that happens weekly?

If this is totally normal activity, then that's fine, but it's a little notable that this seems to look like a strong spike in testing that extends past 5pm.

I think what you're seeing is an artifact of how developers are typically working. Each usage spike correlates directly with tests being run. It's likely that many developers are pushing changes before they sign off for the day, which causes the spike to extend after 5pm. When the tests run normally, they take about an hour to run. With retries they can take longer, with a worst case being when the RBS pool is exhausted and the tests will wait until resources are ready within the 90 minute test timeout window.

You can see the history of the tests here https://github.com/DataBiosphere/jade-data-repo/actions/workflows/int-and-connected-test-run.yml

And the nightly test configuration is

  schedule:
    - cron: '0 4 * * *' # run at 4 AM UTC, 12PM EST.

TL;DR as far as I can tell this is normal (expected) activity for this pool usage.

Copy link
Contributor

@jyang-broad jyang-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audited the test runs, it appears that one test run uses roughly 10% of the pool which is about 100 workspaces (I believe I've heard this number is actually about 150 but is mitigated by workspace recovery during the test). this occurs over roughly 20 min, Buffer takes about 1 hour to recover 100 those 100 workspaces.

So about ~7-10 test runs over 1-5 hours would bankrupt the pool, which is what I saw seen.

I expect that bumping this should allow for 14-20 test runs over 5 hours.

@pshapiro4broad pshapiro4broad merged commit e59e3f6 into master Mar 17, 2025
6 checks passed
@pshapiro4broad pshapiro4broad deleted the ps/dt-1342-increase-datarepo-tools-pool branch March 17, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants